Index

1. Introduction

2. Classification Models

    2.1. Linear Probability Model

    2.2. Linear Discriminant Analysis

    2.3. Quadratic Discriminant Analysis

    2.4. Logistic Regression

    2.5. Regularised Methods

    2.6. K Nearest Neighbour

    2.7. Generalised Additive Model

3. Conclusion



1. Introduction

This report presents a statistical analysis of the wine quality dataset released by the Portuguese firm Vinho Verde, aiming at classifying sample qualities based on their physical-chemical properties. To this end, various models and different approaches are compared so to individuate the best in terms of out-of-sample predictions.

The original data are split according to the wine typology, with 4898 white wine samples and 1599 are reds. However, the two sets are considered altogether in this report, and a variable type is added with value 1 for white wines and 0 otherwise. All in all, the final dataset is then composed of a total of \(n=6497\) statistical units, with 11 variables expressing physical-chemical qualities and the additional feature type. The target variable is quality and measures the sample quality as an ordinal categorical variable in the range from 1 to 10. Each of its values is obtained as the median of at least 3 votes expressed by expert sommeliers.



As we can see, the actual range of the target is the inteval \([3,9] \in \mathbb{N}\). Also, the categories 5 and 6 are the most numerous and they alone account for the 76.6% of the whole sample. This may be a problem when training since the model could learn to predict just the most represented classes. For this reason, we try to balance the target variable by considering a binary version obtained as follows:

\[ \begin{equation} \textit{binary_quality} = \begin{cases} 1 &\textit{se quality} > 5 \\ 0 &\textit{altrimenti} \end{cases} \end{equation}\]

The physical-chemical information available is contained in the following variables:

  • Fixed Acidity (fixed.acidity)
  • Volatile Acidity (volatile.acidity)
  • Citric Acid (citric.acid)
  • Residual Sugar (residual.sugar)
  • Chlorides (chlorides)
  • Free Sulphur Dioxide (free.sulfur.dioxide)
  • Total Sulphur Dioxide (total.sulfur.dioxide)
  • Density (density)
  • pH (pH)
  • Sulphates (sulphates)
  • Alcohol (alcohol)

As first insight, the following plots illustrate the distributions of each variable in both subsamples (red and yellow curves) and in the whole dataset (blue curve):




The features are mostly bell-shaped, except for total.sulfur.dioxide, citric.acid and chlorides that show some symptoms of being multimodal. Moreover, all the variables have a quite similar range of variation, thus suggesting some pre-processing (e.g. standardisation). For this reason, we performed the analysis both with and without preliminary transformations of the original data, and we found non-significant differences between the two approaches. Hence, only results obtained without pre-processing are shown in the following for convenience.

Although correlation does not mean causation, it makes sense to explore the dependency structure between the target variable and each of the predictors.

For this purpose, the correlation matrix computed on the whole dataset is reported below:



The target quality (in its original format) is very poorly correlated with the predictors, and just in two cases we observe a linear correlation coefficient above 0.3 (alcohol and density). On the other hand, predictors appear highly linked to one another. In particular, the strongest relationships we can observe are:

  • free.sulfur.dioxide - total.sulfur.dioxide (0.72)
  • density - alcohol (-0.687)
  • density - residual.sugar (0.553)

However, given the categorical nature of the variable Quality, it would be more appropriate to study its correlation structure based on the Spearman coefficient rather than the Pearson one:



The Spearman coefficient, \(\rho\), measures the degree of relation between two ordinal categorical variables. Unlike the Pearson coefficient, Spearman \(\rho\) assumes just a monotonic relationship between the variables (not necessarily a linear one).

For more details on the dataset check:

https://archive.ics.uci.edu/ml/datasets/wine+quality

The analyses presented in the following are conducted considering about 60% of the data as training set and the remaining 40% for testing.



2. Classification Models

In this section, we experiment with the use of different models for the classification task, and then we compare the results to pick the one that has better out of sample predictions. The input variables considered in the following are the 12 physical-chemical features plus the variable type, with value 1 for white wines and 0 for reds. The target variable is the binary version of quality obtained assigning 0 to wines of poor quality (<=5) and 1 otherwise.

As first attempt, we explore the class of linear methods, i.e. models that separate each class of the output based on a linear function (or some monotonic transformation). Alternatively, Quadratic Discriminant Analysis (QDA), k-Nearest Neighbours (kNN) e Generalised Additive Models (GAM) in the final part.



2.1. Linear Probability Model

The regression model defines the linear discriminant function, \(\delta(x)\), as the expected value of the target conditioning on the observed data, \(E[Y|X]\). In case of a dichotomous outcome, this is equivalent to estimating the probability for an observation to belong to category 1, \(P(Y=1|X)\), hence the name linear probability model. However \(E[Y|X]\) is by construction a real number, so it lends poorly in estimating a quantity that lies between 0 and 1. Nonetheless, it is possible to show that the sum of the probabilities to belong to every single class is 1.

To see why this is true, consider the general case of a multinomial classification problem (G categories) based on p regressors. Then, the LPM approach consists of adopting G binary regression models simultaneously, one per each class. The issue then resolves to estimating the probability of belonging to each category and classifying the observation as the class which returns the highest response. In matricial form:

\[ \hat{Y}_{n \times G} = X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G}\]

Post-multiplying by a G-dimensional column vector of ones we can then sum the elements of \(\hat{Y}\):

\[\hat{Y}_{n \times G} \textbf{1}_{G \times 1} = X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}\]

Now, pre-multiplying the identity matrix of order \(n\) written as \((X {X}^{T})^{-1}(X {X}^{T})\), it is easy to verify that the previous equation resolves to:

\[(X {X}^{T})^{-1}(X {X}^{T}) X_{n \times (p+1)} ({X}^{T} X)_{(p+1) \times (p+1)}^{-1} {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}\]

\[(X {X}^{T})^{-1}X \biggl[({X}^{T} X_{n \times (p+1)}) ({X}^{T} X)_{(p+1) \times (p+1)}^{-1}\biggr] {X}_{(p+1) \times n}^{T} Y_{n \times G} \textbf{1}_{G \times 1}\]

\[\biggl[(X {X}^{T})^{-1}X {X}_{(p+1) \times n}^{T}\biggr] Y_{n \times G} \textbf{1}_{G \times 1}\]

\[\hat{Y}_{n \times G} \textbf{1}_{G \times 1} = Y_{n \times G} \textbf{1}_{G \times 1}\]

Hence the sum of the predicted scores is 1 for all the observations. In fact, \(\hat{Y_1}+ ... + \hat{Y_G} = 1\) by construction.

For this reason, although it is not possible to interpret the predicted scores as probabilities, they share this property.

Passing to the application, we adopted such a model considering all of the initial regressors. It is easy to verify as, in general, the score of such models is not guaranteed to lie between 0 and 1, as anticipated:



Notice that almost all of the coefficients are statistically significant for any reasonable confidence level \(\alpha\). The only exceptions are chlorides and citric.acid, with p-values of respectively 0.37 and 0.08.



Focusing on the estimates for the coefficients, it is possible to notice as the most relevant variable for the prediction seems to be density. However, the high value of this coefficient (-36.12) appears balanced by the intercept, both in terms of punctual estimate and variability. This behaviour is due to the feature having a dense distribution which is highly correlated to the target. Thus, each slight variation of its value is associated with a significant change in the predicted outcome. This guess is confirmed when performing the analysis with standardised variables. In fact, this time the coefficient for density is much more in line with the other estimates. Also, the most relevant features become volatile.acidity, alcohol and type (respectively -0.15, 0.14 and - 0.13). No changes in the patterns of significance are observed though.



Once a score has been obtained, it is then necessary to fix a cutoff so to classify each observation into one of the two categories. This choice is not that obvious, as the outcome of the model is not bounded in \([0,1]\). For this reason, the choice of the optimal threshold is don based on a grid search so to minimise the misclassification error. In particular, using cutoff values between 0 and 1 with gaps of 0.01, the optimal choice for the threshold is 0.54.

Having used the test set for choosing the optimal cutoff, it is no longer possible to use them to evaluate the performances of the model without overestimating due to overfitting. For this reason, the prediction error is retrieved based on Leave One Out cross-validation. In particular, for a linear regression with quadratic loss it is possible to compute this measure estimating the model just one time instead of \(n\):

\[ GCV = \frac{1}{n} \sum_{i=1}^{n} { \biggl[ \frac{ y_i - \hat{f}(x_i) } {1- \frac{trace(S)}{n}} \biggr]^2 } \]

where \(trace(S)\) represents the effective degrees of freedom of the model (in our case \(p+1=13\)).

Notice that, despite our loss not being a quadratic function, it is easy to show that it is equivalent to the misclassification error due to the dichotomic nature of \(y_i\) and \(\hat{f}(x_i)\). Hence, the approach described above is justified.

In conclusion, the accuracy obtained through the Generalised cross-validation is 0.7402 (instead of the 0.7488 computed in the test set, which is an optimistic estimate as expected). In terms of Receiver Operating Characteristic (ROC) curve, instead, the model has an Area Under Curve (AUC) of 0.8095.



2.2. Linear Discriminant Analysis

The linear discriminant analysis (LDA) consists of an alternative solution for building a discriminant function. Indeed, rewriting \(P(Y=g|X)\) as \(\frac{f_g(x)\pi_g}{\sum_{l=1}^{G}f_l(x)\pi_l}\) it is possible to estimate the probability of each class based on its density function (and its prior probability). Starting from this formulation, the linear discriminant analysis assumes that the \(f_g(x)\) are \(p+1\)-dimensional gaussian distribution with common covariance matrix.

Once the \(f_g(x)\) are estimated based on sample data, it is then possible to show that the discriminant function associated to each category turns out to be: \[\delta_g(x)=x^{T}\Sigma^{-1}\mu_g -\frac{1}{2}\mu_g^T\Sigma^{-1}\mu_g+\log{\pi_g}\] Finally, in order to classify observations into one of the classes we simply pick \(g_*=argmax_g\delta_g(x)\).

Passing to the application, the model reaches an accuracy of 0.7438 and an AUC of 0.8095.



Alternatively, the LDA approach can be rephrased as finding the linear combination of the regressors that better separates the categories of the target variable. In other words, we have to find \(a^Tx\) such that the groups are internally as homogeneous as possible and as distant as possible among each other. This means we want low variability within each class and high between variance: \[\phi_*=argmax_{\textbf{a}}\frac{\textbf{a}^TB\textbf{a}}{\textbf{a}^TW\textbf{a}}\] where \(B\) is the between-groups covariance matrix and \(W\) the within one.

Solving this optimisation problem we get: \[(W^{-1}B-\phi I_{G \times G})\textbf{a}=0\]

This is a homogeneous system of linear equations, so it admits non-trivial solutions if and only if \((W^{-1}B-\phi I)=0\), i.e. if \(\phi\) is an eigenvalue of \(W^{-1}B\) and \(a\) contains the correspondent eigenvector. Algebraically, this implies that we have at most \(G-1\) solutions, precisely as many as the rank of \(W^{-1}B\). As a consequence, there are no more than \(G-1\) discriminant directions. Since \(\phi\) is exactly the quantity we want to maximise, the best discriminant function is given by the eigenvector corresponding to the biggest eigenvalue of \(W^{-1}B\). Namely, it is the surface that better separates one class from all the others.

In the case of two categories, we have a single non-zero eigenvalue and, hence, a single discriminant direction. To understand better the results of this approach, the following plot displays the observations with respect to the linear combination \(a^Tx\), also called canonic coordinate, with the linear decision boundary superimposed.



Notice that the two classes do not appear very separated. Probably, the quite decent performances of the model are because the dataset is unbalanced and that the model correctly classifies the most represented categories. i.e. observations of class 1.

Comparing the results of the two LDA formulations, we observe that the vector \(a\) is unexpectedly different from the Coefficients of linear discriminants computed by the R routine. However, this is merely a matter of parameterisation. In fact, R uses a normalisation that makes the within-covariance matrix spherical, i.e.:

\[ \textbf{a}^TW\textbf{a} = \Psi_{diag} \]

Therefore, starting from the second formulation, it is possible to get the same coefficients of the function lda post-multiplying the vector \(\textbf{a}\) by the matrix \(\Psi^{-\frac{1}{2}}\).



2.3. Quadratic Discriminant Analysis

If we drop the assumption of homoscedasticity of the groups, a discriminant analysis approach produce a discriminant function which is quadratic and, in turn, also the separation surface is no longer linear.

Despite using a more complex separation surface, this model achieves worse performances than the LDA, with accuracy and AUC of 0.7322 and 0.8055 respectively.



2.4. Logistic Regression

The logistic regression is one of the most popular methods adopted for classification problems. The advantage of this model is that it allows estimating the probability of the event of interest directly. Once this is achieved, the classification is derived easily based on that probability value.

More specifically, the estimates are obtained exploiting the logistic distribution:

\[P(Y=1|X=x)=\frac{\exp{(\beta_0+\beta^Tx)}}{1+\exp{(\beta_0+\beta^Tx)}}\]

The results of the application to our problem are in line with previous models so far. Performances are slightly worse, with an accuracy of 0.7388 and a test error of 0.2612. The same applies to the AUC (0.8088). The coefficients estimates are all highly significant though, except for the variable chlorides.



2.5. Regularised Methods

The rationale behind these methods is to regularise the estimates of the parameters by shrinking their value towards zero. Among the different options for doing so, we consider three alternatives: ridge regression, logistic lasso and elastic-net. All of them make use of the tuning parameter \(\lambda\) that influences the effect of the penalisation, but they differ in the type of penalty introduced. The learning process is conducted via cross-validation on the training set, while the results reported below are computed on the test set.

The ridge regression uses a penalisation of type \(\ell_2\) and tries to solve the following optimisation problem:

\[\min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}}-{\biggl[ \frac{1}{n} \sum_{i=1}^{n}{y_i(\beta_0+x_i^T\beta)-\log{(1+\exp{(\beta_0+x_i^T\beta))}}} \biggr] + \lambda ||\beta||_2^2}\]

This type of formula allows shrinking parameter values towards zero without ever cancelling it completely. On the contrary, using the \(\ell_1\) norm as a penalty, the lasso can set to precisely zero some of the coefficients provided that \(\lambda\) is sufficiently large:

\[\min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}}-{\biggl[ \frac{1}{n} \sum_{i=1}^{n}{y_i(\beta_0+x_i^T\beta)-\log{(1+\exp{(\beta_0+x_i^T\beta))}}} \biggr] + \lambda ||\beta||_1}\]

Finally, elastic-net builds a hybrid solution between the two above. In fact, it adopts the following double penalty ruled by the parameter \(\alpha\), thus combining the advantages of the previous approaches:

\[\min_{(\beta_0,\beta)\in\mathbb{R}^{p+1}}-{\biggl[ \frac{1}{n} \sum_{i=1}^{n}{y_i(\beta_0+x_i^T\beta)-\log{(1+\exp{(\beta_0+x_i^T\beta))}}} \biggr] + \lambda[ \alpha ||\beta||_1 + (1-\alpha)||\beta||_2^2]}\]

The optimal values for \(\lambda\) and \(\alpha\) are obtained by 10-fold cross-validation.



Analysing the results obtained, we can observe as the ridge regression reaches the highest accuracy. However, the three methods have worse performances than the plain logistic regression, i.e. without penalisation. This suggests that in fact no regularisation is needed for the problem at hand. A further clue in favour of this thesis is then given by the estimates of shrinkage parameters, which are all close to zero. In particular, it is interesting to notice that the value picked for \(\alpha\) in the elastic-net approach is equal to 0.05. This means that the two penalties are not generally very effective and that the one related to the ridge solution is the most relevant, as found in the disentangled experiments. Accuracy-wise the results are really similar, with values of 0.7305, 0.7345 and 0.7195 respectively. Same considerations are drawn looking at the ROC curves, with AUC values of 0.80, 0.7995 and 0.7897.

Focusing on the estimates of the coefficients, we have a confirmation of the fact that ridge regression includes all of the parameters, as expected. O the contrary, the lasso allows performing feature selection based on the coefficients set to 0, i.e. fixed.acidity, chlorides, density and type.



Likewise, elastic-net also performs parameter selection despite a penalisation coefficient very low. This time, however, only fixed.acidity and type are removed from the analysis.

In light of these considerations, and due to the little difference in terms of performances, we may want to select the lasso approach since it reaches similar results with a simpler, more parsimonious model.



2.6. K Nearest Neighbour

The K-Nearest Neighbour is a classification algorithm with no parameters but K. The key principles it is based on are the euclidean distance between observations and the “majority vote” for classification.

As far as the choice of the neighbourhood dimension, we adopt 10-fold cross-validation to select the optimal K. Input data are standardised prior to this process so to avoid to give more importance to variables with larger scales. The optimal value obtained is \(K=23\).

Also this model is in line with previous results, with an accuracy of 0.7572 and an AUC di 0.8192.



2.7. Generalised Additive Model

The GAMs are additive models that enable to overcome the linearity constraint trough the introduction of more flexible functions of the regressors. In the case of binary classification, we can write such a model as follows:

\[\log{ \biggl( \frac{P(Y=1|X)}{1-P(Y=1|X) } \biggr)}=\alpha + \sum_{k=1}^p{f_k(X)}\]

As far as our application is concerned, we consider smoothing splines as transformations of the input features. The optimal degrees of freedom for these functions are selected by 10-fold cross-validation, with a grid search in the range \([1,10]\). The best model turns out to be the one with 7 degrees of freedom, achieving an accuracy of 0.7735 e and an AUC equal to 0.8305.



Nonetheless, the improvement introduced by this greater complexity is minimal, so it may be worth simplifying the model in favour of a more parsimonious one. For this reason, one may wish to select \(df=4\) since it corresponds to a natural cubic spline and it is subject to lower variability.



2.2. Classification and Regression trees

Il secondo tipo di metodi utilizzati rientra nella categoria dei CART (Classification And Regression Tree) che fanno uso degli alberi di decisione per effettuare analisi di classificazione e/o regressione.

Da un punto di vista pratico, durante la sua costruzione, questi modelli individuano una sequenza di bipartizioni in regioni \(R_m\) non sovrapponibili fino a comporre una struttura ad albero fatta di rami e foglie (o nodi terminali), all’interno delle quali la funzione obiettivo viene approssimata tramite una costante.

\[f(x) = \sum_{m=1}^{M}c_mI(x\in R_m)\]

Nel caso preso in esame, trattandosi di un problema di classificazione, possiamo definire \(c_m=p_{mk}\) come la porzione di osservazioni appartenenti alla classe g all’interno del nodo m e stimarla mediante:

\[\hat{p}_{mg}=\frac{1}{N_m}\sum_{x_i\in R_m}I(y_i=g)\]

Le osservazioni vengono quindi assegnate alla classe di maggioranza all’interno del nodo m.

Per quanto riguarda la crescita, tale algoritmo produce una sequenza di nodi conducendo una ricerca esaustiva su tutte le partizioni di ciascun regressore, selezionando poi la variabile e lo split ottimali ad ogni step così da minimizzare una funzione d’errore opportunamente scelta (ricerca greedy). Per questo scopo, possibili scelte utilizzate in letteratura sono:

  • errore di classificazione

  • indice di Gini

  • indice di entropia di Shannon

Nel prosieguo è stato adottato l’indice di Gini come funzione di perdita nella crescita dell’albero per via della sua capacità di discriminare correttamente situazioni in cui l’errore di classificazione è il medesimo, preferendo lo split che porta a dei nodi più puri:

\[Q_m(T)=\sum_{g\neq g'}\hat{p}_{mg}\hat{p}_{mg'}=\sum_{g=1}^G\hat{p}_{mg}(1-\hat{p}_{mg}) \]

Il numero di split da effettuare rientra tra i parametri del modello che vanno ottimizzati. La strategia più comunemente suggerita in letteratura è quella di far crescere un albero molto grande tramite l’imposizione di limiti piuttosto blandi per poi effettuare una potatura a seconda di una funzione di costo-complessità:

\[C_\alpha = \sum_{m=1}^{|T|}{N_m Q_m(T) + \alpha |T|}\] dove:

  • \(T\) indica un sottoalbero contenuto nell’albero massimale, \(T \subset T_{max}\), e \(|T|\) rappresenta il numero di foglie in esso presenti

  • \(N_m\) rappresenta il numero di osservazioni presenti nella regione \(R_m\)

  • \(\alpha\) rappresenta il parametro di penalizzazione della complessità

Passando all’applicazione, i risultati di ottimizzazione di questa funzione sono riportati nel seguente grafico in cui sono rappresentati gli andamenti dell’errore di classificazione nel campione di training (in rosso) e una stima dell’errore di previsione ottenuta per cross validation (blu).



Il valore ottimale per \(\alpha\) è risultato essere pari a 0.0031 (26 nodi terminali), tuttavia applicando il criterio empirico della one-standard error rule è stato possibile selezionare un albero equivalente, ma con solamente 9 foglie e, quindi, di più facile interpretazione.



Guardando alla successione di split dell’albero così costruito, è dunque possibile osservare come le variabili più importanti siano alcohol e volatile.acidity dal momento che sono le uniche due a presentarsi, in maniera alterna, nei primi split.

Seguendo un approccio un po’ più formale, possiamo invece associare a ciascuna variabile una vera e propria misura di quanto ogni regressore sia importante nella costruzione dell’albero. Per fare questo basta misurare tramite cross validation il decremento medio nell’accuratezza previsiva del modello in una fold in cui i valori di un regressore per volta vengono permutati casualmente tra tutte le unità statistiche. In questo modo, tanto più l’incremento dell’errore sarà maggiaro, tanto più l’importanza della variabile sarà limitata. Il risultato di tale procedura è mostrato nella figura seguente:



Come già osservato precedentemente, alcohol e volatile.acidity sono tra le variabili più importanti (rispettivamente prima e quarta). Tuttavia, in questo modo riusciamo ad apprezzare il ruolo svolto anche da density e chlorides nonostante compaiano solamente in bipartizioni finali.

Una volta costruito l’albero è stato poi possibile valutarne le performance in termini di accuracy (0.7478) e AUC (0.7794). I valori ottenuti risultato buoni e solamente di poco inferiori a quelli relativi al GAM, ma hanno permesso la costruzione di un modello molto semplice e di facile interpretazione.





2.3. Random Forest

Gli alberi sono uno strumento molto semplice ma anche molto potente, tuttavia solitamente sono caratterizzati da una grande variabilità, i.e. un piccolo cambiamento nei dati può comportare una sequenza di split molto differente. Ciò è dovuto principalmente al fatto che un errore in uno dei primi split viene propagato gerarchicamente in tutta la struttura sottostante. Per ovviare a questo inconveniente è possibile usare metodi di bagging che consentono di creare una serie di alberi, per l’appunto una foresta, e ottenere una stima con ridotta variabilità mediando tra questi. Il bagging, infatti, consiste nel suddividere il campione iniziale in \(B\) sottocampioni bootstrap, anche detti bag, di dimensioni \(n_b < n\) e nello stimare un modello diverso in ciascuno di essi, validandolo poi sulle \(n-n_b\) osservazioni rimanenti, cosiddette Out Of Bag. Il modello predittivo globale viene costruito, infine, effettuando una media tra molti modelli (in questo caso in forma di alberi) in modo da ridurne la varianza.

In quest’ottica, una Random Forest è ottenuta allenando un gran numero di alberi che vengono fatti crescere con criteri molto blandi così da avere una distorsione auspicabilmente limitata malgrado una varianza elevata, che verrà però ridotta attraverso l’operazione di averaging:

\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x) \]

Inoltre, per evitare che gli alberi adottati siano tra loro indipendenti (caratteristica che limita i benefici nella riduzione della variabilità delle stime dell’approccio bagging), nell’addestramento della RF viene applicato un campionamento bootstrap dei regressori. In questo modo solamente un campione casuale di m < p covariate partecipa alla crescita di ogni albero e ciò permette di ridurre ulteriormente la variabilità del modello finale.

Stante questa descrizione, si deduce che le Random Forest fanno ricorso a due parametri di aggiustamento che corrispondono al numero \(M\) di regressori e il numero \(B\) di alberi da costruire all’interno della foresta. Nel caso in esame, il tuning di questi due parametri è stato condotto sulla base di una grid-search esaustiva in cui \(M\) variava da 1 a 500 e \(B\) da 1 a 12 (massimo numero di regressori). Così facendo, le foreste casuali confrontate spaziavano da una molto semplice (un solo albero, \(B=1\), con una sola variabile, \(M=1\)) ad una molto complessa (cinquecento alberi, \(B=500\), e tutte le variabili, \(M=12\)).

Per quanto riguarda le variaibli considerate, essendo le random forest basate sulla creazione di molti alberi, non è possibile osservare gli split come accade nel singolo albero ma possiamo comunque valutare l’importanza delle variabili utilizzate. In questo caso la variable importance dipende dalla riduzione in termini di errore derivante dall’aver seguito uno split in corrispondendenza di quella variabile, pesato su tuttu gli alberi.



I risultati di questa validazione sono presentati nella figura soprastante. Come si può notare, sfruttando le osservazioni Out Of Bag è stato possibile ottimizzare questi due parametri, scegliendo una combinazione riconducibile ad una foresta con \(B=239\) e solo \(M=4\) regressori.

Le performance così ottenute da questo modello sono molto elevate con un’accuracy di 0.8225 e un’AUC di 0.8935. Le random forest si presentano dunque come il miglior modello utilizzato, seppur caratterizzato da una minor interpretabilità rispetto ai precedenti e, in particolare, rispetto al singolo albero.

Infine, nel grafico seguente è riportata la variable importance, calcolata stavolta sfruttando le osservazioni OOB anzichè la cross-validation.

Anche in questo caso la variabile alcohol è quella che assume una rilevanza maggiore. Tuttavia, stavolta è volatile.acidity a piazzarsi in seconda posizione, seguita a poca distanza da density.





##3. Conclusioni

In questo report, dunque, abbiamo condotto un’analisi il cui scopo è stato quello di mettere a confronto più metodi di classificazione in un’ottica propria della data science, i.e. prendendo come metrica di paragone l’errore di previsione fuori campione.

Nello specifico, il confronto è stato condotto indagando l’utilizzo di metodi di classificazione non lineari quali Generalised Additive Model, Classification And Regression Tree e Random Forest.

Nella tabella di seguito, per ogni modello sono riportati i valori della soglia che massimizza le performance nel test set con la relativa accuratezza raggiunta, la sensibilità e la specificità ed, infine, l’AUC.



Se prendiamo in considerazione l’accuratezza del modello, possiamo notare che la Random Forest è quella che performa meglio (0.8225), seguita dal GAM (0.7725 per il modello con edf pari per tutti i regressori e 0.7678 per quello con edf stimati dall’algoritmo) e dal CART (0.7478).

VALUTARE SE AGGIUNGERE QUALCHE INTERPRETAZIONE TEORICA DEI DATI

Guardando invece all’AUC, le performance dei vari modelli rimangono comparativamente inalterate con aree sotto la curva rispettivamente di 0.8935, 0.83, 0.8261 e 0.7794.



In conclusione, la Random Forest è quella che performa meglio di tutti sia in termini di accuratezza che di AUC, seguita dai due modelli GAM e, infine, il CART. In questo caso quindi, l’introduzione di una maggiore complessità porta effettivamente ad un netto miglioramento delle performance predittive ed è quindi preferibile malgrado la minore interpretabilità.